Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: use progress bar when paginating #3532

Merged
merged 7 commits into from
Mar 28, 2020
Merged

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Mar 25, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • During size estimation, status is displayed via BAR_FMT_NOTOTAL progress bar, as current # of estimated total remote size
  • During threaded traverse, status is displayed via a single progress bar, as # of objects fetched out of the estimated total remote size

Will close #3526

@pmrowla pmrowla self-assigned this Mar 25, 2020
dvc/remote/oss.py Outdated Show resolved Hide resolved
@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 25, 2020

for gdrive remote w/~300k-400k files:
asciicast

@shcheklein
Copy link
Member

minor: NO_TOTAL progress bar still looks very similar to the regular one, I first thought there is a bug - wondering if this format could be improved to make it more explicit? cc @casperdcl

@shcheklein
Copy link
Member

awesome stuff, btw, @pmrowla ... thanks for taking care of UI here!

Copy link
Contributor

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, er, progress!

dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Show resolved Hide resolved
dvc/remote/gdrive.py Outdated Show resolved Hide resolved
dvc/remote/gs.py Outdated Show resolved Hide resolved
dvc/remote/ssh/__init__.py Show resolved Hide resolved
tests/unit/remote/test_base.py Outdated Show resolved Hide resolved
@@ -691,7 +691,7 @@ def path_to_checksum(self, path):
def checksum_to_path_info(self, checksum):
return self.path_info / checksum[0:2] / checksum[2:]

def list_cache_paths(self, prefix=None):
def list_cache_paths(self, prefix=None, progress_callback=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

progress_callback is unused in ssh and local remotes, but the parameter was added to both so that overridden list_cache_paths() in ssh/local still match the abstract method defined in RemoteBASE.list_cache_paths()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but presumably it should be used now that it's added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both local and ssh have their own overridden cache_exists() implementation and neither one uses list_cache_paths() at all during cache_exists(). The only time list_cache_paths() would get called for either local or ssh remotes would be for RemoteBASE.all() during a dvc gc call.

all() does not currently display any progressbars at all, but now that I'm thinking about this, it probably should

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed; though I'm not sure in principle about having a function exists which takes a non-null progress_callback and ignores it. This implies a bar will be displaying hanging at 0% on screen and then eventually suddenly disappear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_cache_paths() for local and ssh have been updated to use progress_bar (if it is set) to keep them consistent with the other remotes. A separate issue (#3543) has been filed for updating the dvc gc/RemoteBASE.all() behavior and will be addressed in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw just to clarify - every time I've said "unused?" it's more a reminder to address at some point (could be in a future PR) rather than an expectation to address in this PR.

dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/s3.py Outdated Show resolved Hide resolved
dvc/remote/hdfs.py Outdated Show resolved Hide resolved
@casperdcl casperdcl added enhancement Enhances DVC ui user interface / interaction labels Mar 26, 2020
pmrowla and others added 6 commits March 27, 2020 12:17
- add `progress_callback` parameter to `list_cache_paths()` so that
  remotes can update remote cache traverse progress bar after fetching a
  page
- keep ssh/local consistent with other remotes (even though `dvc
  gc`/`RemoteBASE.all()` do not currently use progress bars)
@pmrowla pmrowla changed the title [WIP] remote: use progress bar when paginating remote: use progress bar when paginating Mar 27, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 27, 2020

asciicast

The estimation progress bar still looks similar to the remote cache query progress bar, but given that the estimation status is just calculating the initial and total values for the cache query progress bar, it seems ok to me.

I'm not sure if it would make much of a difference to the user whether the estimation and main cache query appears to be a single operation or as two explicitly distinct operations.

@pmrowla pmrowla marked this pull request as ready for review March 27, 2020 03:57
@shcheklein
Copy link
Member

The estimation progress bar still looks similar to the remote cache query progress bar

do you mean that it looks like an actual progress bar but in this case we don't show any progress?

@shcheklein
Copy link
Member

from what I see from code we do show actual progress now, right?

so, what is your concern, Peter? they are similar, and it looks totally fine to me ... may be I'm missing something.

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 27, 2020

from what I see from code we do show actual progress now, right?

so, what is your concern, Peter? they are similar, and it looks totally fine to me ... may be I'm missing something.

My concern was only that it still hasn't really changed since what you commented on for the original video.

I think what you are noting as "showing actual progress now" is more a result of the performance differences between s3 and gdrive. For reference, the current code w/gdrive looks like:
asciicast

@shcheklein
Copy link
Member

yes, I see, it doesn't indeed show the progress itself - it's impossible since we don't know what the size is, right? we'll be able to do this though when we implement the short cutting in your other PR? at least we'll have an upper bound of how many page we want to go through

is there some other quick style we can apply - like spinner or something?

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 27, 2020

Yes, after the short-cutting PR we would have an upper-bound for the push/pull cache size estimation step and could show progress that way. However for dvc gc/RemoteBASE.all() (#3543) which always needs to fetch all of the remote cache entries, we would be back to having no upper-bound on the size estimation step

@shcheklein
Copy link
Member

@pmrowla make sense, yep.

@casperdcl is there a spinner-like mode for TQDM? :)

It looks like it's a bit more general problem that we should probably solve in a separate PR? looks like we just need a UX component for such cases, most likely we'll have more applications for it. @pmrowla @efiop what do you think?

I believe there are lot of project we can even use (though we wan't be able to show units in a nice way?) - https://github.com/pavdmyt/yaspin. That's why it would be great to have a mode like this in TQDM itself.

Comment on lines 858 to 863
self.list_cache_paths(
prefix=prefix,
progress_callback=lambda n=1: pbar.update(
n * total_prefixes
),
),
Copy link
Contributor

@efiop efiop Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would extract this into a separate var and use it instead. There are too many levels of brackets in this set(map(... line , which makes it really hard to reat it πŸ™‚ Same down below.

Comment on lines 937 to 941
list(
self.list_cache_paths(
prefix=prefix, progress_callback=pbar.update
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move list into a separate variable and then map(self.path_to_checksum, paths). Just to make it easier to read πŸ™‚

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! @pmrowla please check my two nitpicky comments above and feel free to merge. The questions discussed earlier could indeed be solved later with the followup PR, I agree with @shcheklein . I'm eager to unblock #3537 πŸ™‚

@pmrowla pmrowla merged commit acdc876 into iterative:master Mar 28, 2020
@pmrowla pmrowla deleted the 3526 branch March 28, 2020 03:52
@casperdcl
Copy link
Contributor

@shcheklein there's no spinner mode but I think the "no total" progress stats are much better than a spinner. I guess you'd like both the stats and a spinner? The updating stats function as a spinner to me :)

@shcheklein
Copy link
Member

@casperdcl yep, I agree that it's definitely not critical! ... like I mentioned, the only reason for my concern was that since they are so similar and happen one after another, you kind-a expect to see progress happening (visual part of it - where it's emptiness now) - but when you don't see it you start asking yourself a question - is still at 0% 😱 :)

Maybe just let it align the stats close to the text? To make it more explicit that visual part is not expected.

@casperdcl
Copy link
Contributor

casperdcl commented Mar 28, 2020

no strong opinions here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC ui user interface / interaction
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote: use progress bar when paginating
5 participants